Reading recommendation
Since I am very methodical this notebook is well structured, very detailed and quite long. I warmly recommend reading this notebook with an interactive table of contents side-panel for convenient navigation. Jupyter with jupyter_contrib_nbextensions supports these and many more useful perks (I warmly recommend!), but they are unsupported in Git or a normal Jupyter, so -
Ways to view this notebook with an interactive navigation side-panel:
Author: oz.livneh@gmail.com
DatingAI is a personal research project that I started for my interest, challenge and experience, aiming to:
The architecture presented in this notebook predicts for each image the user score value given to the dating profile to which the image belongs. This assumes that all images in each profile are independent and given the same score (of their profile), which is obviously a simplification.
Why regression suits better than classification: Instead of predicting the image score class, this architecture predicts the image score value - a regression task, to take into consideration the distance between scores in the loss, e.g. for a target score of +3, predicting -1 (err=4) should be much worse than predicting +2 (err=1), but in classification the penalty on both mistakes is the same.
net_architecture options:'my simple CNN' - a (very) simple CNN I wrote mainly for debugging. 'inception v3','resnet18'.freeze_pretrained_net_weights options:True: weights of pretrained models are left untrained except for the last layer.False: weights of pretrained models are trained entirely starting from pretrained weights.In this demo I use Inception v3 with freeze_pretrained_net_weights=False.
Each row in profiles_df, the output of script_Personal_cupid_scraper.py, represents a profile that has in its 'image filenames' column a nested list of image filenames, 0 or more. Therefore the dataset used in here is based on unnested_images_df, created by unnesting the image filenames from profiles_df such as each row in unnested_images_df represents an image, the the columns for each row are taken from the columns of the profile to which the image belonged. (see # Processing data, unnesting images).
import torch
print('torch\t\ttested on 1.1\t\tcurrent:',torch.__version__)
import sys
print('Python\t\ttested on 3.6.5\t\tcurrent:',sys.version[:sys.version.find(' ')])
import numpy as np
print('numpy\t\ttested on 1.16.3\tcurrent:',np.__version__)
import pandas as pd
print('pandas\t\ttested on 0.24.0\tcurrent:',pd.__version__)
import matplotlib
print('matplotlib\ttested on 3.0.3\t\tcurrent:',matplotlib.__version__)
#--------- general -------------
debugging=True # executes the debugging sections, prints their results
# debugging=False
torch_manual_seed=0 # integer or None for no seed; for torch reproducibility, as much as possible
#torch_manual_seed=None
#--------- data -------------
data_folder_path=r'D:\My Documents\Dropbox\Python\DatingAI\Data'
images_folder_path=r'D:\My Documents\Dropbox\Python\DatingAI\Data\Images'
df_pickle_file_name='profiles_df.pickle'
"""
dataset downsampling (samples = images):
if max_dataset_length>0: builds a dataset by sampling only
max_dataset_length samples from all available data.
requires user approval!
if max_dataset_length<=0: not restricting dataset length - using all
available data
"""
max_dataset_length=1000
# max_dataset_length=0
seed_for_dataset_downsampling=0 # integer or None for no seed; for sampling max_dataset_length samples from dataset
"""
random_transforms - data augmentation by applying random transforms
(random crop, horizontal flip, color jitter etc.) defined at
# Building a PyTorch dataset of images with transforms
"""
# random_transforms='train & val' # data augmentation on both train and val phases
random_transforms='train' # data augmentation only on train phase, validation is free of random transforms
# random_transforms='none' # no data augmentation
load_all_images_to_RAM=False # default; loads images from hard drive for each sample in the batch by the PyTorch efficient (multi-processing) dataloader
# load_all_images_to_RAM=True # loads all dataset images to RAM; estimates dataset size and requires user approval
validation_ratio=0.5 # validation dataset ratio from total dataset length
batch_size_int_or_ratio_float=8 # if int: each batch will contain this number of samples
#batch_size_int_or_ratio_float=1e-2 # if float: batch_size=round(batch_size_over_dataset_length*dataset_length)
data_workers=0 # 0 means no multiprocessing in dataloaders
#data_workers='cpu cores' # sets data_workers=multiprocessing.cpu_count()
shuffle_dataset_indices_for_split=True # dataset indices for dataloaders are shuffled before splitting to train and validation indices
#shuffle_dataset_indices_for_split=False
dataset_shuffle_random_seed=0 # numpy seed for sampling the indices for the dataset, before splitting to train and val dataloaders
#dataset_shuffle_random_seed=None
dataloader_shuffle=True # samples are shuffled inside each dataloader, on each epoch
#dataloader_shuffle=False
#--------- net -------------
# architecture_is_a_pretrained_model=False
# net_architecture='my simple CNN'
architecture_is_a_pretrained_model=True
net_architecture='inception v3'
#net_architecture='resnet18'
# freeze_pretrained_net_weights=True # freezes pretrained model weights except the last layer
freeze_pretrained_net_weights=False # trains pretrained models entirely, all weights, starting from pretrained values
loss_name='MSE'
#--------- training -------------
train_model_else_load_weights=True
#train_model_else_load_weights=False # instead of training, loads a pre-trained model and uses it
force_train_evaluation_after_each_epoch=True # adding evaluation of the training dataset after each epoch finishes training
# force_train_evaluation_after_each_epoch=False # default
epochs=15
learning_rate=2e-4
optimizer_name='SGD'
SGD_momentum=0.7 # default: 0.9
# optimizer_name='Adam'
Adam_betas=(0.7,0.999) # default: (0.9,0.999)
lr_scheduler_decay_factor=0.9 # applies to all optimizers; on each lr_scheduler_step_size epochs, learning_rate*=lr_scheduler_decay_factor
lr_scheduler_step_size=1
best_model_criterion='min val epoch MSE' # criterion for choosing best net weights during training as the final weights
return_to_best_weights_in_the_end=True # when training complets, loads weights of the best net, definied by best_model_criterion
#return_to_best_weights_in_the_end=False
period_in_seconds_to_log_loss=30 # <=0 means no logging during training, else: inter-epoch logging and reporting loss and metrics during training
#plot_realtime_stats_on_logging=True # incomplete implementation!
plot_realtime_stats_on_logging=False
#plot_realtime_stats_after_each_epoch=True
plot_realtime_stats_after_each_epoch=False
#plot_loss_in_log_scale=True
plot_loss_in_log_scale=False
#offer_mode_saving=True # offer model weights saving ui after training (only if train_model_else_load_weights=True)
offer_mode_saving=False
models_folder_path='D:\My Documents\Dropbox\Python\DatingAI\Data\Saved Models'
I could hide all my function and class definitions in another script, but I want this notebook to be clear and self-contained. Also, Github now supports jumping to definitions!
import logging
logging.basicConfig(format='%(asctime)s %(funcName)s (%(levelname)s): %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
logger=logging.getLogger('main logger')
logger.setLevel(logging.INFO)
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import PIL
import random
from time import time
import copy
import multiprocessing
if data_workers=='cpu cores':
data_workers=multiprocessing.cpu_count()
import torch
torch.manual_seed(torch_manual_seed)
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import DatingAI
DatingAI.torch.manual_seed(torch_manual_seed)
device=DatingAI.device
# a charm for interactive plotting in Jupyter notebook (useful for zooming, rotating 3D plots):
%matplotlib notebook
Data structure is demonstrated in # 1.4 Data structure.
profiles_df_path=os.path.join(data_folder_path,df_pickle_file_name)
profiles_df=pd.read_pickle(profiles_df_path)
# finding score column (column name in the format of 'score (levels=%d)'%score_levels)
score_column_name=None
for column in profiles_df.columns:
if 'score' in column:
score_column_name=column
break
if score_column_name==None:
raise RuntimeError("no existing column name in profiles_df contains 'score'!")
# unnesting images
unnested_images_dict={}
for profile_index in range(len(profiles_df)):
row_series=profiles_df.iloc[profile_index,:]
profile_id=row_series['profile id']
profile_score=row_series[score_column_name]
for filename in row_series['image filenames']:
if filename=='pq_400.pn': # skipping this blank profile image (in a strange format)
continue
if os.path.isfile(os.path.join(images_folder_path,filename)):
unnested_images_dict.update({len(unnested_images_dict):{
'profile index':profile_index,
'profile id':profile_id,
'score':profile_score,
'image filename':filename}})
else:
logging.warning(f'profile {profile_index}: {filename} not found -> skipping image!')
unnested_images_df=pd.DataFrame.from_dict(unnested_images_dict,orient='index')
if max_dataset_length>0 and max_dataset_length<len(unnested_images_df):
user_data_approval=input('ATTENTION: downsampling is chosen - building a dataset by sampling only max_dataset_length=%d samples from all available data! approve? y/[n] '%(round(max_dataset_length)))
if user_data_approval!='y':
raise RuntimeError('user did not approve dataset max_dataset_length sampling!')
random.seed(seed_for_dataset_downsampling)
sampled_indices=random.sample(range(len(unnested_images_df)),max_dataset_length)
unnested_images_df=unnested_images_df.iloc[sampled_indices]
logger.info('comleted unnesting images from profiles_df to unnested_images_df of length %d'%(len(unnested_images_df)))
image_num_to_sample=5
# end of inputs ---------------------------------------------------------------
if debugging:
logger.info('checking image shapes of %d sampled images'%image_num_to_sample)
sampled_indices_list=random.sample(range(len(unnested_images_df)),image_num_to_sample)
for i in sampled_indices_list:
df_row=unnested_images_df.iloc[i,:]
image_filename=df_row['image filename']
image_array=plt.imread(os.path.join(images_folder_path,image_filename))
print(f'{image_filename} shape:',image_array.shape)
In this section the transforms are defined (not all are random, some are required independently of data augmentation), then a dataset is built, tested, and only later training and validation datasets and dataloaders are built.
If load_all_images_to_RAM=True, the size of the dataset is estimated and then the user can choose to load all images to the RAM.
random_transforms set in # Setting main parameteres controls data augmentation:random_transforms='train & val': data augmentation on both train and val phases. The dataset that is created here is later split to training and validation datasets (and dataloaders).random_transforms='train': data augmentation only on train phase, validation is free of random transforms. The dataset that is created here with transforms, and later new separate train/val datasets (and dataloaders) are created, with/without random transforms.random_transforms='train': no data augmentation. The dataset that is created here is later split to training and validation datasets (and dataloaders).n_to_sample_for_data_size_estimation=10 # only if load_all_images_to_RAM=True was set
"""torchvision.transforms.ToTensor() Converts a PIL Image or numpy.ndarray (H x W x C) in the
range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the
range [0.0, 1.0] if the PIL Image belongs to one of the modes
(L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) or if the numpy.ndarray has
dtype = np.uint8
In the other cases, tensors are returned without scaling
source: https://pytorch.org/docs/stable/torchvision/transforms.html
"""
if random_transforms=='none':
random_transforms_ui=input('random_transforms=False was set, meaning no data augmentation, approve? [y]/n ')
if random_transforms_ui=='n':
raise RuntimeError('user did not approve no data augmentation, aborting')
if architecture_is_a_pretrained_model:
if net_architecture=='inception v3':
input_size_for_pretrained=299
else:
input_size_for_pretrained=224
transform_func_with_random=torchvision.transforms.Compose([
torchvision.transforms.Resize(input_size_for_pretrained+10),
torchvision.transforms.RandomCrop(input_size_for_pretrained),
torchvision.transforms.ColorJitter(brightness=0.1,contrast=0.1,saturation=0,hue=0),
torchvision.transforms.RandomHorizontalFlip(p=0.5),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225]), # required for pre-trained torchvision models!
])
transform_func_no_random=torchvision.transforms.Compose([
torchvision.transforms.Resize(input_size_for_pretrained),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225]), # required for pre-trained torchvision models!
])
else:
transform_func_with_random=torchvision.transforms.Compose([
# torchvision.transforms.Resize(400),
torchvision.transforms.RandomCrop(390),
torchvision.transforms.ColorJitter(brightness=0.1,contrast=0.1,saturation=0,hue=0),
torchvision.transforms.RandomHorizontalFlip(p=0.5),
torchvision.transforms.ToTensor(),
])
transform_func_no_random=torchvision.transforms.Compose([
torchvision.transforms.Resize(390),
torchvision.transforms.ToTensor(),
])
# end of inputs ---------------------------------------------------------------
if random_transforms=='none':
transform_func=transform_func_no_random
else:
transform_func=transform_func_with_random
dataset=DatingAI.unnested_images_dataset(unnested_images_df=unnested_images_df,
images_folder_path=images_folder_path,transform_func=transform_func)
if load_all_images_to_RAM:
# estimating dataset size based on sampled samples(images)
sampled_sample_indices=random.sample(range(len(dataset)),n_to_sample_for_data_size_estimation)
sampled_images_dict_in_RAM=DatingAI.build_images_dict_in_RAM(
image_filenames_list=list(unnested_images_df['image filename'].iloc[sampled_sample_indices]),
images_folder_path=images_folder_path)
image_np_arrays_size_MB=sum([sys.getsizeof(np.array(image)) for image in sampled_images_dict_in_RAM.values()])/1e6
expected_sampled_images_dict_in_RAM_size_MB=image_np_arrays_size_MB/n_to_sample_for_data_size_estimation*len(dataset)
user_decision_RAM=input('load_all_images_to_RAM=True was set, estimated dataset size based on %d random samples: %.1eMB, load all images to RAM? y/[n] '%(
n_to_sample_for_data_size_estimation,expected_sampled_images_dict_in_RAM_size_MB))
if user_decision_RAM=='y':
logger.info('started loading all images to RAM')
images_dict_in_RAM=DatingAI.build_images_dict_in_RAM(
image_filenames_list=list(unnested_images_df['image filename']),
images_folder_path=images_folder_path)
dataset=DatingAI.unnested_images_dataset(
unnested_images_df=unnested_images_df,
images_dict_for_RAM_loading=images_dict_in_RAM,
transform_func=transform_func)
image_np_arrays_size_MB=sum([sys.getsizeof(np.array(image)) for image in images_dict_in_RAM.values()])/1e6
logger.info('completed loading all images to RAM, size: %.1eMB'%image_np_arrays_size_MB)
else:
logger.info('user disapproved loading all dataset to RAM, keeping it on the hard drive and loading with a dataloader')
sample_size=dataset[0]['image'].size()
sample_pixels_per_channel=sample_size[1]*sample_size[2]
sample_pixels_all_channels=sample_size[0]*sample_pixels_per_channel
logger.info('set a PyTorch dataset of length %d, input size (assuming it is constant): (%d,%d,%d)'%(
len(unnested_images_df),sample_size[0],sample_size[1],sample_size[2]))
#sample_indices_to_plot=range(20) # for dataset plotting verification
random.seed(0)
sample_indices_to_plot=random.sample(range(len(unnested_images_df)),20)
images_per_row=5
figure_size=(10,10) # (width,height) in inches
# end of inputs ---------------------------------------------------------------
if debugging:
DatingAI.plot_unnested_images_dataset(sample_indices_to_plot,dataset,
figure_size,images_per_row,
image_format='PIL->torch',normalize=True)
plt.suptitle('plotting from pytorch dataset, 1st time')
if random_transforms!='none':
DatingAI.plot_unnested_images_dataset(sample_indices_to_plot,dataset,
figure_size,images_per_row,
image_format='PIL->torch',normalize=True)
plt.suptitle('plotting from pytorch dataset, 2st time (to see random transforms)')
DatingAI.plot_unnested_images_df(sample_indices_to_plot,unnested_images_df,
images_folder_path,figure_size,images_per_row)
plt.suptitle('plotting from raw data - unnested_images_df')
target_is_continuous=False # target is not continuous, profile scores are discrete
normalization='over total' # heights=counts/sum(discrete_hist)
opacity=0.6
# end of inputs ---------------------------------------------------------------
# splitting
dataset_length=len(unnested_images_df)
dataset_indices=list(range(dataset_length))
split_index=int((1-validation_ratio)*dataset_length)
if shuffle_dataset_indices_for_split:
np.random.seed(dataset_shuffle_random_seed)
np.random.shuffle(dataset_indices)
train_indices=dataset_indices[:split_index]
val_indices=dataset_indices[split_index:]
dataset_indices={'train':train_indices,'val':val_indices}
logger.info('dataset indices split to training and validation, with validation_ratio=%.1f, lengths: (train,val)=(%d,%d)'%(
validation_ratio,len(train_indices),len(val_indices)))
# plotting target distibutions
plt.figure()
for phase in ['train','val']:
targets_list=unnested_images_df.iloc[dataset_indices[phase]]['score'].values
DatingAI.easy_hist(targets_list,distribution_is_continuous=target_is_continuous,
normalization=normalization,label=phase,opacity=opacity)
plt.title('training and validation target distributions')
plt.xlabel('target values')
plt.legend(loc='best');
This figure presents a few important things:
# setting batch size
if isinstance(batch_size_int_or_ratio_float,int):
batch_size=batch_size_int_or_ratio_float
elif isinstance(batch_size_int_or_ratio_float,float):
batch_size=round(batch_size_int_or_ratio_float*dataset_length)
else:
raise RuntimeError('unsupported batch_size input!')
if batch_size<1:
batch_size=1
logger.warning('batch_size=round(batch_size_over_dataset_length*dataset_length)<1 so batch_size=1 was set')
if batch_size==1:
user_batch_size=input('got batch_size=1, may cause errors, enter a new batch size equal or larger than 1, or smaller than 1 to abort: ')
if user_batch_size<1:
raise RuntimeError('aborted by user batch size decision')
else:
batch_size=round(user_batch_size)
# building datasets
if random_transforms=='train': # means applying random transforms only on train, so separate datasets must be created
if load_all_images_to_RAM and user_decision_RAM=='y':
train_dataset=DatingAI.unnested_images_dataset(
unnested_images_df=unnested_images_df.iloc[train_indices],
images_dict_for_RAM_loading=images_dict_in_RAM,
transform_func=transform_func)
val_dataset=DatingAI.unnested_images_dataset(
unnested_images_df=unnested_images_df.iloc[val_indices],
images_dict_for_RAM_loading=images_dict_in_RAM,
transform_func=transform_func)
else:
train_dataset=DatingAI.unnested_images_dataset(
unnested_images_df=unnested_images_df.iloc[train_indices],
images_folder_path=images_folder_path,
transform_func=transform_func_with_random)
val_dataset=DatingAI.unnested_images_dataset(
unnested_images_df=unnested_images_df.iloc[val_indices],
images_folder_path=images_folder_path,
transform_func=transform_func_no_random)
else:
dataset_to_split=dataset
# splitting the dataset to train and val
train_dataset=torch.utils.data.Subset(dataset_to_split,train_indices)
val_dataset=torch.utils.data.Subset(dataset_to_split,val_indices)
# building the train and val dataloaders
train_dataloader=torch.utils.data.DataLoader(train_dataset,batch_size=batch_size,
num_workers=data_workers,shuffle=dataloader_shuffle)
val_dataloader=torch.utils.data.DataLoader(val_dataset,batch_size=batch_size,
num_workers=data_workers,shuffle=dataloader_shuffle)
# structuring
datasets={'train':train_dataset,'val':val_dataset}
dataset_samples_number={'train':len(train_dataset),'val':len(val_dataset)}
dataloaders={'train':train_dataloader,'val':val_dataloader}
dataloader_batches_number={'train':len(train_dataloader),'val':len(val_dataloader)}
logger.info('dataset split to training and validation datasets and dataloaders with validation_ratio=%.1f, lengths: (train,val)=(%d,%d)'%(
validation_ratio,dataset_samples_number['train'],dataset_samples_number['val']))
images_per_row=4
normalize=True # normalizes all pixels in each channel to be in [0,1]. needed for plotting, after the strange normalization that torchvision models require
figure_size=(8,6) # (width,height) in inches
# end of inputs ---------------------------------------------------------------
if debugging:
if __name__=='__main__' or data_workers==0:
for phase in ['train','val']:
batch=next(iter(dataloaders[phase]))
DatingAI.plot_unnested_images_batch(batch,figure_size=figure_size,
images_per_row=images_per_row,normalize=normalize)
plt.suptitle('plotting a batch from the %s dataloader'%phase)
else:
logger.warning('cannot use multiprocessing (data_workers>0 in dataloaders) in Windows when executed not as main')
if net_architecture=='my simple CNN':
class my_CNN(nn.Module):
def __init__(self):
super(my_CNN, self).__init__()
self.conv1 = nn.Conv2d(3,6,8,stride=4)
self.pool = nn.MaxPool2d(2,2)
self.conv2 = nn.Conv2d(6,16,8,stride=4)
self.fc1 = nn.Linear(400,120)
self.fc2 = nn.Linear(120,84)
self.fc3 = nn.Linear(84,1)
def forward(self,x):
x=F.relu(self.conv1(x))
x=self.pool(x)
x=F.relu(self.conv2(x))
x=self.pool(x)
x=x.view(-1,np.array(x.shape[1:]).prod()) # don't use x.view(batch_size,-1), which fails for batches smaller than batch_size (at the end of the dataloader)
x=F.relu(self.fc1(x))
x=F.relu(self.fc2(x))
x=self.fc3(x)
return x
model=my_CNN()
parameters_to_optimize=model.parameters()
elif net_architecture=='resnet18':
model=torchvision.models.resnet18(pretrained=True)
if freeze_pretrained_net_weights:
for param in model.parameters():
param.requires_grad=False
parameters_to_optimize=model.fc.parameters()
else:
parameters_to_optimize=model.parameters()
model.fc=nn.Linear(model.fc.in_features,1) # Parameters of newly constructed modules have requires_grad=True by default
elif net_architecture=='inception v3':
model=torchvision.models.inception_v3(pretrained=True)
if freeze_pretrained_net_weights:
for param in model.parameters():
param.requires_grad=False
# Parameters of newly constructed modules have requires_grad=True by default:
model.AuxLogits.fc=nn.Linear(768,1)
model.fc=nn.Linear(2048,1)
if freeze_pretrained_net_weights:
parameters_to_optimize=[]
for name,parameter in model.named_parameters():
if parameter.requires_grad:
parameters_to_optimize.append(parameter)
else:
parameters_to_optimize=model.parameters()
else:
raise RuntimeError('untreated net_architecture!')
model=model.to(device)
total_weights_num=sum(p.numel() for p in model.parameters())
trainable_weights_num=sum(p.numel() for p in model.parameters() if p.requires_grad)
logger.info("set '%s' net on %s, trainable/total weigths: %.1e/%.1e"%(
net_architecture,device,trainable_weights_num,total_weights_num))
if loss_name=='MSE':
loss_fn=nn.MSELoss(reduction='mean').to(device)
else:
raise RuntimeError('untreated loss_name input')
if optimizer_name=='SGD':
optimizer=torch.optim.SGD(parameters_to_optimize,lr=learning_rate,momentum=SGD_momentum)
elif optimizer_name=='Adam':
optimizer=torch.optim.Adam(parameters_to_optimize,lr=learning_rate,betas=Adam_betas)
else:
raise RuntimeError('untreated optimizer_name input')
scheduler=torch.optim.lr_scheduler.StepLR(optimizer,
step_size=lr_scheduler_step_size,gamma=lr_scheduler_decay_factor)
# comment last lines from Net.forward() to check outputs of earlier lines
if debugging:
if __name__=='__main__' or data_workers==0:
batch=next(iter(dataloaders['train']))
labels=batch['profile score'].to(device).unsqueeze(1).float()
images=batch['image']
images=images.to(device)
print('batch images.shape:',images.shape)
model.eval()
outputs=model(images)
print('outputs.shape:',outputs.shape)
with torch.set_grad_enabled(False):
MSE=((outputs.flatten()-labels)**2).sum()/len(labels)
sqrt_MSE=MSE**0.5
sqrt_MSE=sqrt_MSE.item()
print('batch sqrt(MSE):',sqrt_MSE)
else:
logger.warning('cannot use multiprocessing (data_workers>0 in dataloaders) in Windows when executed not as main')
def model_evaluation(model,dataloader,loss_fn):
model.eval() # set model to evaluate mode
epoch_loss=0.0 # must be a float
epoch_samples_number=0
label_arrays_list=[]
output_arrays_list=[]
for i_batch,batch in enumerate(dataloader):
images=batch['image'].to(device)
labels=batch['profile score'].to(device).unsqueeze(1).float()
# forward
with torch.set_grad_enabled(False): # if phase=='train' it tracks tensor history for grad calc
outputs=model(images)
loss=loss_fn(outputs,labels)
# accumulating
samples_number=len(labels)
epoch_samples_number+=samples_number
current_loss=loss.item()*samples_number # the loss is averaged across samples in each minibatch, so it is multiplied to return to a total
epoch_loss+=current_loss
label_arrays_list.append(labels.flatten().cpu().numpy())
output_arrays_list.append(outputs.flatten().cpu().numpy())
# post-processing
epoch_loss_per_sample=epoch_loss/dataset_samples_number[phase]
labels_array=np.concatenate(label_arrays_list)
outputs_array=np.concatenate(output_arrays_list)
return labels_array,outputs_array,epoch_loss_per_sample
# if force_train_evaluation_after_each_epoch:
# train_evaluation_ui=input('force_train_evaluation_after_each_epoch=True was set, it is inefficient (but useful for training analysis), approve? y/[n] ')
# if train_evaluation_ui!='y':
# raise RuntimeError('user did not approve force_train_evaluation_after_each_epoch=True, aborting!')
if plot_realtime_stats_on_logging or plot_realtime_stats_after_each_epoch:
logger.warning('plotting from inside the net loop is not working, should be debugged...')
if train_model_else_load_weights and (__name__=='__main__' or data_workers==0):
stats_dict={'train':{'running metrics':{},'epoch total running metrics':{},
'evaluation metrics':{}},
'val':{'evaluation metrics':{}}}
if torch.cuda.is_available():
logger.info('torch is using %s (%s)'%(device,torch.cuda.get_device_name(device=0)))
else:
logger.info('torch is using %s'%(device))
# model pre-training evaluation
logger.info('started model pre-training evaluation')
for phase in ['train','val']:
dataloader=dataloaders[phase]
labels_array,outputs_array,epoch_loss_per_sample=\
model_evaluation(model,dataloaders[phase],loss_fn)
errors_array=labels_array-outputs_array
epoch_MSE=(errors_array**2).mean()
logger.info('(pre-training, %s) loss per sample: %.3e, sqrt(MSE): sqrt(%.3e)=%.3e'%(
phase,epoch_loss_per_sample,epoch_MSE,epoch_MSE**0.5))
stats_dict[phase]['evaluation metrics'].update({0:
{'loss per sample':epoch_loss_per_sample,
'MSE':epoch_MSE}})
total_batches=epochs*(dataloader_batches_number['train']+dataloader_batches_number['val'])
period_already_logged=0
logger.info('started model training')
print('-'*10)
tic=time()
# model training
for epoch in range(epochs):
for phase in ['train','val']:
if phase == 'train':
scheduler.step()
model.train() # set model to training mode
else:
model.eval() # set model to evaluate mode
epoch_loss=0.0 # must be a float
epoch_squared_error=0.0
samples_processed_since_last_log=0
loss_since_last_log=0.0 # must be a float
squared_error_since_last_log=0.0
for i_batch,batch in enumerate(dataloaders[phase]):
images=batch['image'].to(device)
labels=batch['profile score'].to(device).unsqueeze(1).float()
optimizer.zero_grad() # zero the parameter gradients
# forward
with torch.set_grad_enabled(phase=='train'): # if phase=='train' it tracks tensor history for grad calc
if net_architecture=='inception v3' and phase=='train':
outputs,aux_outputs=model(images)
loss1=loss_fn(outputs,labels)
loss2=loss_fn(aux_outputs,labels)
loss=loss1+0.4*loss2 # in train mode it has an auxiliary output (to deal with gradient decay); see https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
# loss=loss1 # in train mode it has an auxiliary output (to deal with gradient decay); see https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
else:
outputs=model(images)
loss=loss_fn(outputs,labels)
if torch.isnan(loss):
raise RuntimeError('reached NaN loss - aborting training!')
# backward + optimize if training
if phase=='train':
loss.backward()
optimizer.step()
# accumulating stats
samples_number=len(labels)
samples_processed_since_last_log+=samples_number
current_loss=loss.item()*samples_number # the loss is averaged across samples in each minibatch, so it is multiplied to return to a total
epoch_loss+=current_loss
loss_since_last_log+=current_loss
with torch.set_grad_enabled(False):
batch_squared_error=((outputs-labels)**2).sum().item()
epoch_squared_error+=batch_squared_error
squared_error_since_last_log+=batch_squared_error
# logging running stats
if phase=='train' and period_in_seconds_to_log_loss>0:
passed_seconds=time()-tic
period=passed_seconds//period_in_seconds_to_log_loss
if period>period_already_logged:
period_already_logged=period
loss_since_last_log_per_sample=loss_since_last_log/samples_processed_since_last_log
MSE_since_last_log=squared_error_since_last_log/samples_processed_since_last_log
completed_batches=epoch*(dataloader_batches_number['train']+dataloader_batches_number['val'])+(i_batch+1)
completed_batches_progress=completed_batches/total_batches
logger.info('(epoch %d/%d, batch %d/%d, %s, running) loss per sample : %.3e, sqrt(MSE): sqrt(%.3e)=%.3e'%(
epoch+1,epochs,i_batch+1,dataloader_batches_number[phase],phase,
loss_since_last_log_per_sample,
MSE_since_last_log,MSE_since_last_log**0.5))
partial_epoch=epoch+completed_batches_progress
stats_dict[phase]['running metrics'].update({partial_epoch:
{'batch':i_batch+1,'loss per sample':loss_since_last_log_per_sample,
'MSE':MSE_since_last_log}})
loss_since_last_log=0.0 # must be a float
squared_error_since_last_log=0.0
samples_processed_since_last_log=0
# accumulating epoch stats
epoch_loss_per_sample=epoch_loss/dataset_samples_number[phase]
epoch_MSE=epoch_squared_error/dataset_samples_number[phase]
if phase=='train': # saving running stats
stats_dict[phase]['epoch total running metrics'].update({epoch+1:
{'loss per sample':epoch_loss_per_sample,
'MSE':epoch_MSE}})
if force_train_evaluation_after_each_epoch: # train dataloader evaluation
labels_array,outputs_array,epoch_loss_per_sample=\
model_evaluation(model,dataloaders[phase],loss_fn)
errors_array=labels_array-outputs_array
epoch_MSE=(errors_array**2).mean()
stats_dict[phase]['evaluation metrics'].update({epoch+1:
{'loss per sample':epoch_loss_per_sample,
'MSE':epoch_MSE}})
else: # val dataloader evaluation
stats_dict[phase]['evaluation metrics'].update({epoch+1:
{'loss per sample':epoch_loss_per_sample,
'MSE':epoch_MSE}})
if phase=='val': # updating best model results
if best_model_criterion=='min val epoch MSE':
best_criterion_current_value=epoch_MSE
if epoch==0:
best_criterion_best_value=best_criterion_current_value
best_model_wts=copy.deepcopy(model.state_dict())
best_epoch=epoch
else:
if best_criterion_current_value<best_criterion_best_value:
best_criterion_best_value=best_criterion_current_value
best_model_wts=copy.deepcopy(model.state_dict())
best_epoch=epoch
# logging evaluation stats
if phase=='val' or force_train_evaluation_after_each_epoch:
completed_epochs_progress=(epoch+1)/epochs
passed_seconds=time()-tic
expected_seconds=passed_seconds/completed_epochs_progress*(1-completed_epochs_progress)
expected_remainder_time=DatingAI.remainder_time(expected_seconds)
logger.info('(epoch %d/%d, %s, evaluation) epoch loss per sample: %.3e, epoch sqrt(MSE): sqrt(%.3e)=%.3e\n\tProgress: %.2f%%, ETA: %dh:%dm:%.0fs'%(
epoch+1,epochs,phase,
epoch_loss_per_sample,
epoch_MSE,epoch_MSE**0.5,
100*completed_epochs_progress,
expected_remainder_time.hours,
expected_remainder_time.remainder_minutes,
expected_remainder_time.remainder_seconds))
print('-'*10)
toc=time()
elapsed_sec=toc-tic
logger.info('finished training %d epochs in %dm:%.1fs'%(
epochs,elapsed_sec//60,elapsed_sec%60))
if return_to_best_weights_in_the_end:
model.load_state_dict(best_model_wts)
logger.info("loaded weights of best model according to '%s' criterion: best value %.3f achieved in epoch %d"%(
best_model_criterion,best_criterion_best_value,best_epoch+1))
else: # train_model_else_load_weights==False
model_name_ui=input('model weights file name to load: ')
model_weights_file_path=os.path.join(models_folder_path,model_name_ui)
if not os.path.isfile(model_weights_file_path):
raise RuntimeError('%model_weights_path does not exist!')
model_weights=torch.load(model_weights_file_path)
model.load_state_dict(model_weights)
logger.info('model weights from %s were loaded'%model_weights_file_path)
I started this project for my interest, challenge, experience. "Great" results are not any of the goals, and not expected, since:
# plot_running_stats=True
plot_running_stats=False
plot_epoch_total_running_stats=True
# plot_epoch_total_running_stats=False
plot_loss_in_log_scale=False
# plot_loss_in_log_scale=True
figure_size=(10,4) # (width,height) in inches
# end of inputs ---------------------------------------------------------------
logger.info("remember that even if loss_name='MSE', the loss may include regularization or auxiliary terms (as in inception v3) and therefore may not equal MSE!")
if not (plot_realtime_stats_on_logging or plot_realtime_stats_after_each_epoch):
fig=plt.figure(figsize=figure_size)
plt.suptitle('training stats')
loss_subplot=plt.subplot(1,2,1)
MSE_subplot=plt.subplot(1,2,2)
DatingAI.training_stats_plot(stats_dict,fig,loss_subplot,MSE_subplot,plot_loss_in_log_scale,
plot_running_stats,plot_epoch_total_running_stats)
Since return_to_best_weights_in_the_end=True and best_model_criterion='min val epoch MSE' were set, the weights of the minimal validation MSE achieved during training were set in the model after training completed. As logged (or shown in the training stats figure), the best criterion value was achieved in epoch 3. The model with those weights is evaluated here:
distribution_is_continuous=True # target is not continuous, profile scores are discrete
normalization='over total' # heights=counts/sum(discrete_hist)
opacity=0.6
bins=50
# end of inputs ---------------------------------------------------------------
logger.info('started model evaluation')
plt.figure()
for phase in ['train','val']:
dataloader=dataloaders[phase]
labels_array,outputs_array,epoch_loss_per_sample=\
model_evaluation(model,dataloaders[phase],loss_fn)
errors_array=labels_array-outputs_array
DatingAI.easy_hist(errors_array,distribution_is_continuous=distribution_is_continuous,
bins=bins,normalization=normalization,label=phase,opacity=opacity)
epoch_MSE=(errors_array**2).mean()
logger.info('(post-training, %s) loss per sample: %.3e, sqrt(MSE): sqrt(%.3e)=%.3e'%(
phase,epoch_loss_per_sample,epoch_MSE,epoch_MSE**0.5))
logger.info('completed model evaluation')
plt.title('training and validation error distributions')
plt.xlabel('errors (targets-predictions)')
plt.legend(loc='best');
if offer_mode_saving and train_model_else_load_weights:
try: os.mkdir(models_folder_path)
except FileExistsError: pass # if the folder exists already - do nothing
saving_decision=input('save model weights? [y]/n ')
if saving_decision!='n':
model_name_ui=input('name model weights file: ')
model_weights_file_path=os.path.join(models_folder_path,model_name_ui+'.ptweights')
if os.path.isfile(model_weights_file_path):
alternative_filename=input('%s already exists, give a different file name to save, the same file name to over-write, or hit enter to abort: '%model_weights_file_path)
if alternative_filename=='':
raise RuntimeError('aborted by user')
else:
model_weights_file_path=os.path.join(models_folder_path,alternative_filename+'.ptweights')
torch.save(model.state_dict(),model_weights_file_path)
logger.info('%s saved'%model_weights_file_path)
logger.info('script completed')